Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063

fegin · 2025-11-19T20:48:24Z

Stack from ghstack (oldest at bottom):

-> Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063

This PR allows us to check if the loss is consistent across commits/PRs.

This PR contains a pre-tested losses result file.
This PR improve the loss_compare.py to add --import and --export options.
In CI, uses --import to get the previous losses and compare them with the current PR. If anything mismatch (10 steps), the CI will fail.

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: c23ff6a Pull-Request: #2063

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 4715975 Pull-Request: #2063

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: d5265c3 Pull-Request: #2063

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: f4541cb Pull-Request: #2063

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 3faf77c Pull-Request: #2063

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: a625db7 Pull-Request: #2063

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2063 * __->__ #2062 This will prevent errors when later doing git checkout

tianyu-l · 2025-11-20T03:04:42Z

tests/assets/losses/llama3.txt

maybe put in https://github.com/pytorch/torchtitan/tree/main/tests/assets and just call it llama3_losses.txt?

n00b q:Is this ground truth loss come from a single GPU run? Or is it FSDP only?

tianyu-l · 2025-11-20T03:06:23Z

.github/workflows/integration_test_8gpu_features.yaml

-
        # Verify the accuracy.
-        echo "Checking FSDP4 v.s. HSDP2FSDP2TP2 accuracy parity"
+        echo "Checking FSDP4 v.s. HSDP2FSDP4 accuracy parity"


I found HSDP2FSDP4 confusing.
Is this FSDP 8 vs. HSDP (4, 2)?

yes, we can use HSDP (4, 2), I don't think we have a formal way to write HSDP, or do we, lol?

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 212e08b Pull-Request: #2063

tianyu-l

Please move assets/ci_llama3_losses.txt to tests/assets/llama3_losses.txt.
If we want to extend this to more models, I would suggest creating a folder tests/assets/losses/llama3.txt

tianyu-l · 2025-11-21T22:30:02Z

.github/workflows/integration_test_8gpu_features.yaml

-        python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --baseline-ngpus=4 --test-ngpus=8 --steps=1
+        export test_options="--parallelism.data_parallel_replicate_degree=4"
+        python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs" --assert-equal --steps=10  --import-result assets/ci_llama3_losses.txt
+        rm -rf $RUNNER_TEMP/artifacts-to-be-uploaded/*


Is this because you dump something to the folder, so that later runs will complain it's not empty? I think we should make this dump folder optional so that the first run doesn't use it at all.

The reason why we need artifacts-to-be-uploaded is that torchtitan will output something to the output folder and the default is outputs. But creating outputs will fail because the file system is read-only. So, basically if we want to run a TorchTitan job, we will need to redirect the outputs to artifacts-to-be-uploaded.

I feel it is too much to make outputs to be optional because there will be several checks in the trainer. And all these are just for CI. I would rather say the integration tests shouldn't expect the folder to be empty.

[ghstack-poisoned]

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 3ce2d32 Pull-Request: #2063

fegin added 2 commits November 19, 2025 12:48

Update

d7beeb3

[ghstack-poisoned]

Update (base update)

25d1381

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners November 19, 2025 20:48

fegin added a commit that referenced this pull request Nov 19, 2025

Add import and export options to loss_compare.py

17476eb

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: c23ff6a Pull-Request: #2063

fegin mentioned this pull request Nov 19, 2025

Let loss_compare.py check the repo cleaness #2062

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025

fegin marked this pull request as draft November 19, 2025 20:49

Update

7274b81

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 19, 2025

Add import and export options to loss_compare.py

1492cc6

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 4715975 Pull-Request: #2063

fegin added 2 commits November 19, 2025 16:03

Update

296f93b

[ghstack-poisoned]

Update (base update)

a699162

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 20, 2025

Add import and export options to loss_compare.py

211c85d

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: d5265c3 Pull-Request: #2063

Update

30bc03c

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 20, 2025

Add import and export options to loss_compare.py

a57dcc7

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: f4541cb Pull-Request: #2063

Update

40b07a5

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 20, 2025

Add import and export options to loss_compare.py

b387e1b

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 3faf77c Pull-Request: #2063

Update

0a94bfc

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 20, 2025

Add import and export options to loss_compare.py

97863d8

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: a625db7 Pull-Request: #2063

fegin marked this pull request as ready for review November 20, 2025 01:57

fegin changed the title ~~Add import and export options to loss_compare.py~~ Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses Nov 20, 2025

fegin added a commit that referenced this pull request Nov 20, 2025

Let loss_compare.py check the repo cleaness (#2062)

8bf2265

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2063 * __->__ #2062 This will prevent errors when later doing git checkout

tianyu-l reviewed Nov 20, 2025

View reviewed changes

Update

49ce58c

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 20, 2025

Add import and export options to loss_compare.py

5a03c15

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 212e08b Pull-Request: #2063

fegin requested a review from tianyu-l November 21, 2025 07:06

tianyu-l approved these changes Nov 21, 2025

View reviewed changes

fegin added 2 commits November 24, 2025 13:19

Update (base update)

0b8560b

[ghstack-poisoned]

Update

3e3a19f

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 24, 2025

Add import and export options to loss_compare.py

8c927dc

This allows us to check if the loss is consistent across commits/PRs ghstack-source-id: 3ce2d32 Pull-Request: #2063

fegin changed the base branch from gh/fegin/46/base to main November 24, 2025 22:11

fegin merged commit c70310c into main Nov 24, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063

Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063

Uh oh!

fegin commented Nov 19, 2025 •

edited

Loading

Uh oh!

tianyu-l Nov 20, 2025

Uh oh!

wwwjn Nov 21, 2025

Uh oh!

tianyu-l Nov 20, 2025

Uh oh!

fegin Nov 20, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Nov 21, 2025

Uh oh!

fegin Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063

Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses #2063

Uh oh!

Conversation

fegin commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin commented Nov 19, 2025 •

edited

Loading